First of all, I want to Check the data dowloaded…
dim(df)
## [1] 1599 13
check for missing values…
anyNA(df)
## [1] FALSE
check for out of scale quality values…
any(df$quality > 10 || df$quality < 0)
## [1] FALSE
check for negative values…
any(df < 0)
## [1] FALSE
Once checked let’s start with the analysis:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
There are 1599 different wines observations with 13 variables each one with the following distribution:
X, is the observation index (wine)quality, is the experts quality rate (dependent variable)## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The general summary shows some interesting values to consider:
quality is 6 in a 0 to 10 scale…total.sulfur.dioxide, volatile.acidity, citric.acid, chlorides, residual.sugar or free.sulfur.dioxide are between 4 and 7 times the respective medianGiven these extreme values, it will be interesting to see dispersion of the chemical properties:
## fixed.acidity volatile.acidity citric.acid
## 1.741096318 0.179059704 0.194801137
## residual.sugar chlorides free.sulfur.dioxide
## 1.409928060 0.047065302 10.460156970
## total.sulfur.dioxide density pH
## 32.895324478 0.001887334 0.154386465
## sulphates alcohol quality
## 0.169506980 1.065667582 0.807569440
The total.sulfur.dioxide, free.sulfur.dioxide or residual.sugar dispesion is too high, as we suspected above.
Let’s see some data distribution for each variable
And the Normal Q-Q for each one
Looking at the plots we can observe several properties right skewed, and few properties seems normally distributed like pH, density and quality as well, but the density line in their histogram is not showing that.
Let’s take a look what if we change the scale to logaritmic to all the vars…
Interesting… we can see that the variables skewed like fixed.acidity, volatile.acidity, chlorides and sulphates turn to a ‘more’ normal distribution. Let’s add them to the dataframe:
#adding transformed vars to the dataframe with the suffix '.log'
df$fixed.acidity.log <- log10(df$fixed.acidity)
df$volatile.acidity.log <- log10(df$volatile.acidity)
df$chlorides.log <- log10(df$chlorides)
df$sulphates.log <- log10(df$sulphates)
Performing again a Normal Q-Q test with the new log variables:
It seems that the new variables perform better than the original ones..
Due to the importance of quality in this study, lets take a look at quality histogram deeper.
Some insights about quality:
Lets test the normality of quality
##
## Shapiro-Wilk normality test
##
## data: df$quality
## W = 0.85759, p-value < 2.2e-16
Well… looking at the resutls we can NOT ensure that quality is normally distributed, but we can assume it with a relative high level of confidence, considering the sample size and the normal curve. The Q-Q plot, support my decision, taking into account that quality is a discrete variable.
The main feature of interest is quality as discused here
At this point is difficult to answer this question, but intuitively I suggest alcohol, acidity related features, and sulfur Dioxide, but not sure which one of the two free.sulfur.dioxideor total.sulfur.dioxide. I need a correlation test to answer this question with more confidence, that I’ll performe in the next section.
Yes, I did. I eplain why in the next question. Additionally, I’m thinking to split quality into bad, medium and good wines
As we can see in the Features Histograms & Normality,citric.acid is the feature with an unusual distribution.residual.sugarand chrolideshave a right-skewed distribution with a heavy long tail, what sugests the existence of outliers.
I’ve performed a logaritmic transformation to fixed.acidity, volatile.acidity, chlorides and sulphates due to their right-skewed distribution, that transforms the var into a more normalized distribution.
Let’s start with the correlation matrix applied to the dataframe with the new variables:
## citric.acid residual.sugar free.sulfur.dioxide
## citric.acid 1.00000000 0.14357716 -0.0609781292
## residual.sugar 0.14357716 1.00000000 0.1870489951
## free.sulfur.dioxide -0.06097813 0.18704900 1.0000000000
## total.sulfur.dioxide 0.03553302 0.20302788 0.6676664505
## density 0.36494718 0.35528337 -0.0219458312
## pH -0.54190414 -0.08565242 0.0703774985
## alcohol 0.10990325 0.04207544 -0.0694083536
## quality 0.22637251 0.01373164 -0.0506560572
## fixed.acidity.log 0.66716292 0.10927782 -0.1509186427
## volatile.acidity.log -0.56495716 0.01051618 -0.0001783224
## chlorides.log 0.18178017 0.10228456 -0.0021952453
## sulphates.log 0.33151619 0.01601568 0.0480522317
## total.sulfur.dioxide density pH
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## fixed.acidity.log -0.10460535 0.67477009 -0.70636020
## volatile.acidity.log 0.09281289 0.04542565 0.22311544
## chlorides.log 0.05837562 0.35193852 -0.28362873
## sulphates.log 0.01329585 0.16612354 -0.15411585
## alcohol quality fixed.acidity.log
## citric.acid 0.10990325 0.22637251 0.66716292
## residual.sugar 0.04207544 0.01373164 0.10927782
## free.sulfur.dioxide -0.06940835 -0.05065606 -0.15091864
## total.sulfur.dioxide -0.20565394 -0.18510029 -0.10460535
## density -0.49617977 -0.17491923 0.67477009
## pH 0.20563251 -0.05773139 -0.70636020
## alcohol 1.00000000 0.47616632 -0.09885158
## quality 0.47616632 1.00000000 0.11423756
## fixed.acidity.log -0.09885158 0.11423756 1.00000000
## volatile.acidity.log -0.22862294 -0.39124918 -0.26393947
## chlorides.log -0.30396099 -0.17613996 0.19892956
## sulphates.log 0.13515624 0.30864193 0.19790674
## volatile.acidity.log chlorides.log sulphates.log
## citric.acid -0.5649571634 0.181780174 0.33151619
## residual.sugar 0.0105161845 0.102284562 0.01601568
## free.sulfur.dioxide -0.0001783224 -0.002195245 0.04805223
## total.sulfur.dioxide 0.0928128858 0.058375623 0.01329585
## density 0.0454256502 0.351938519 0.16612354
## pH 0.2231154407 -0.283628734 -0.15411585
## alcohol -0.2286229380 -0.303960993 0.13515624
## quality -0.3912491821 -0.176139965 0.30864193
## fixed.acidity.log -0.2639394669 0.198929558 0.19790674
## volatile.acidity.log 1.0000000000 0.127885951 -0.29473814
## chlorides.log 0.1278859514 1.000000000 0.24307622
## sulphates.log -0.2947381357 0.243076224 1.00000000
Analyzing the correlations, we can observe that the most correlated properties with quality are fixed.acidity.log, volatile.acidity.log, citric.acid, chlorides.log, total.sulfur.dioxide, density, sulphates.log and alcohol.
Let’s study the correlation with some boxplots:
Now I will create a new factorized variable called quality.bucket as I mentioned abobe, by cutting the original quality variable into bad, medium and good wines,. The cutting levels will be:
bad if quality is <= 4medium if quality > 4 and <= 6good if quality is > 6# Create a new Factorized variable cuting the quality original variable
dfCor$quality.bucket <- cut(dfCor$quality, breaks = c(0, 4, 6, 10),right = TRUE, labels = c("Bad", "Medium", "Good"))
Let’s see the boxplots now:
In this part of the study, I selected few variables (fixed.acidity.log, volatile.acidity.log, citric.acid, chlorides.log, total.sulfur.dioxide, density, sulphates.log and alcohol), that seems to have more correlation with ths quality variable, thanks to the correlation matrix.
After taht, I decided to plot each of the selected variables using the boxplot, getting some interesting insights:
citric.acid, alcohol, fixed.acidity.log and sulphates.log have positive relation with the quality of wine.density, volatile.acidity.log and chlorides.log have negative relation with the quality of winetotal.sulfur.dioxide has no aperent relationThere are several interesting realtions that need to be observed:
pH has strong correlation with the acidity rtelated properties like acid.citric (positive) or fixed.acidity (positive) .density is correlated with acid.citric or residual.sugar, as well as alcohol or fixed.acidity. The realtion of the density with the residual.sugar (positive) or alcohol (negative) is expected, but not with the acid.citric (positive) or the fixed.acidity (positive).quality is strongly correlated with alcohol near followed by volatile.acidity.
Let’s see how this main features are related between them, but now grouped by quality.bucket :
Let’s take a look more closely at how alcohol is realted with the other main variables grouped by quality.bucket:
Using Multivariate Analysis, let me observe more closely the relations between the properties, and some of the relations found at Bivariate Analysis has been confirmed:
volatile.acidity and density has strong negative correlation with quality and alcoholalcohol has strong positive correlation with quality, citric.acid and sulphatesYes, there were. First of all, that the amount of alcohol in a wine, seems to increase the quality, and second, that the good wines tend to have more sulphates, which change my idea that ‘the more preservatives a wine has worst is the quality’.
For this part of the Study, I will transform the log variables to the original ones, so the values shown in the plots will be the correct.
This plot is interesting because concentrates the main features that, after my investigation, seems to influence more in the wine quality.
citric.acid, sulphates.log and alcohol.volatile.acidity which is negative correlated with quality.This is one of the most important conclusions extracted from my investigation. The Quality of wine gets better with higher levels of alcohol and lower Volatile Acidity.
The last plot shows the second important conclusion. The Quality of wine gets better with higher levels of sulphates. I supose that with sulphates, as a good preservative, the wine gets worst slowly than the more ‘natural’ ones. More Sulphates and more Alcohol, better Wine.
The Exploratory Data Analysis done in this project and along all the course, has been very useful to understand not only the utilization of R, but for understanding meaning of some staistical tools in real situations. In this case, the quality of red wines in function of other chemical properties. Tools like scatterplots or density plots, helped me to draw the long list of values in a simple and meaningful way.
The findings were surprisingly positive, like the positive relation between alcohol and quality, or the negative one with amount of sulphates. Before this study, I was completely misunderstood, because I thought that a ‘bad’ wine had more % of Alcohol.
We need to keep in mind, that this is a small sample of red wines, and the quality variable is a subjective value of an expert, with all the implications that have… maybe a better solution is to have a median value of a group of experts.
An other consideration, is that it would have been interesting to have the Geolocation of each observation, so we could compare the wine quality with the designation of origin, and plot it into a map. The result could be interesting…